Loading Data Tutorial

MLDB operates on data via Datasets, which can be created and populated in three different ways:

  1. You can create a mutable Dataset and insert data row by row via REST.
  2. You can create a Dataset from an existing file.
  3. You can create a Dataset by running a Procedure.

Creating a dataset via REST by POSTing rows

Creating a Dataset is a simple REST call. The notebook cells below use pymldb's Connection class to make REST API calls. You can check out the Using pymldb Tutorial for more details.


In [11]:
from pymldb import Connection
mldb = Connection()

Let's create a dataset called example.


In [12]:
mldb.put('/v1/datasets/example', { "type":"sparse.mutable" })


Out[12]:
PUT http://localhost/v1/datasets/example
201 Created
{
  "status": {
    "columnCount": 0, 
    "rowCount": 0, 
    "valueCount": 0
  }, 
  "config": {
    "type": "sparse.mutable", 
    "id": "example"
  }, 
  "state": "ok", 
  "type": "sparse.mutable", 
  "id": "example"
}

That's all there is to it, and now we can add some rows.


In [13]:
mldb.post('/v1/datasets/example/rows', {
    "rowName": "first row",
    "columns": [
        ["first column", 1, 0],
        ["second column", 2, 0]
    ]
})

mldb.post('/v1/datasets/example/rows', {
    "rowName": "second row",
    "columns": [
        ["first column", 3, 0],
        ["second column", 4, 0]
    ]
})
mldb.post("/v1/datasets/example/commit")


Out[13]:
POST http://localhost/v1/datasets/example/commit
200 OK

So now we have a little bit of data in our dataset. Let's check.


In [14]:
mldb.query("select * from example")


Out[14]:
first column second column
_rowName
first row 1 2
second row 3 4

Creating a Dataset from a file

Now we'll load a dataset from a CSV file. Let's start by creating a simple CSV file to import:


In [15]:
%mkdir /mldb_data/datasets


mkdir: cannot create directory ‘/mldb_data/datasets’: File exists

In [16]:
%%writefile /mldb_data/datasets/sample.csv
a column,another column
a,b
c,d


Overwriting /mldb_data/datasets/sample.csv

We can import this dataset with a simple procedure.


In [17]:
mldb.put('/v1/procedures/import_example2', {
    "type":"import.text",
    "params": {
        "dataFileUrl":"file:///mldb_data/datasets/sample.csv",
        "outputDataset": "example2",
        "runOnCreation": True
    }
})


Out[17]:
PUT http://localhost/v1/procedures/import_example2
201 Created
{
  "status": {
    "firstRun": {
      "runStarted": "2016-03-22T16:20:51.3476582Z", 
      "status": {
        "numLineErrors": 0
      }, 
      "runFinished": "2016-03-22T16:20:51.3526092Z", 
      "id": "2016-03-22T16:20:51.347551Z-5bc7042b732cb41f", 
      "state": "finished"
    }
  }, 
  "config": {
    "params": {
      "outputDataset": "example2", 
      "runOnCreation": true, 
      "dataFileUrl": "file:///mldb_data/datasets/sample.csv"
    }, 
    "type": "import.text", 
    "id": "import_example2"
  }, 
  "state": "ok", 
  "type": "import.text", 
  "id": "import_example2"
}

And a query to validate that things got loaded correctly!


In [18]:
mldb.query("select * from example2")


Out[18]:
a column another column
_rowName
2 a b
3 c d

Creating a Dataset by running a Procedure on another Dataset

Procedures take Datasets as inputs and can create Datasets as outputs. This is how you can do data cleanup/transformation in MLDB. Here's a simple example with the transform Procedure:


In [19]:
mldb.put('/v1/procedures/example', {
    "type": "transform",
    "params": {
        "inputData": 'select "first column" + "second column" as "transformed column" from example',
        "outputDataset": "example3"
    }
})
mldb.post('/v1/procedures/example/runs')


Out[19]:
POST http://localhost/v1/procedures/example/runs
201 Created
{
  "status": {
    "columnCount": 1, 
    "rowCount": 2, 
    "valueCount": 3
  }, 
  "runStarted": "2016-03-22T16:20:51.3896525Z", 
  "state": "finished", 
  "runFinished": "2016-03-22T16:20:51.3908167Z", 
  "id": "2016-03-22T16:20:51.389521Z-5bc7042b732cb41f"
}

In [20]:
mldb.query("select * from example3")


Out[20]:
transformed column
_rowName
first row 3
second row 7

Where to next?

Check out the other Tutorials and Demos.


In [ ]: